6 research outputs found
An Automated Pipeline for Character and Relationship Extraction from Readers' Literary Book Reviews on Goodreads.com
Reader reviews of literary fiction on social media, especially those in
persistent, dedicated forums, create and are in turn driven by underlying
narrative frameworks. In their comments about a novel, readers generally
include only a subset of characters and their relationships, thus offering a
limited perspective on that work. Yet in aggregate, these reviews capture an
underlying narrative framework comprised of different actants (people, places,
things), their roles, and interactions that we label the "consensus narrative
framework". We represent this framework in the form of an actant-relationship
story graph. Extracting this graph is a challenging computational problem,
which we pose as a latent graphical model estimation problem. Posts and reviews
are viewed as samples of sub graphs/networks of the hidden narrative framework.
Inspired by the qualitative narrative theory of Greimas, we formulate a
graphical generative Machine Learning (ML) model where nodes represent actants,
and multi-edges and self-loops among nodes capture context-specific
relationships. We develop a pipeline of interlocking automated methods to
extract key actants and their relationships, and apply it to thousands of
reviews and comments posted on Goodreads.com. We manually derive the ground
truth narrative framework from SparkNotes, and then use word embedding tools to
compare relationships in ground truth networks with our extracted networks. We
find that our automated methodology generates highly accurate consensus
narrative frameworks: for our four target novels, with approximately 2900
reviews per novel, we report average coverage/recall of important relationships
of > 80% and an average edge detection rate of >89\%. These extracted narrative
frameworks can generate insight into how people (or classes of people) read and
how they recount what they have read to others
A Cognition-Driven Approach To Modeling Document Generation and Learning Underlying Contexts From Documents
The development of the Web has, among its other direct influences, provided a vast amount of data to researchers in several disciplines. While in the early stages of its growth the data often went unseen and was secondary to the other products the Internet made available, in the past decade it has quickly become a primary resource for a large number of online applications and has given possibility to many analyses and studies. Text data in particular has been a cornerstone of these works in an attempt to better understand human knowledge and behavior.This work focuses on analysis of the process of writing documents and the abstract underlying contexts driving this process. We propose a generative model for documents based on psychological models of human memory search, and from there we define structures that can represent these abstract contexts.Recent works in psychology literature suggest the brain's memory search can be modeled as a random walk on a semantic network (Abbott et al., 2012). The vast body of research available on random walks in different disciplines, and more recently for their use in analyzing the structure of the web and developing search engines, makes this model particularly appealing for understanding and simulating the brain's process of vocabulary selection and document generation. It can also be used to drive lexical applications and automated text analyses such as exploring the inherent structures existing in a language and the relationship between words.In this work, we present a network approach to describing document generation and discovering contexts. We form an associative network of words based on co-occurrence, with ties between words weighted by the number of documents in the corpus they simultaneously appear in. By inspecting the hierarchical modularity of this network and using the random walk model and community detection algorithms based on random walks, we can find communities of words that form contextually homogeneous groups. Within a certain context defined by one of these groups, the relative importance of every other word can be determined by creating a contextually biased word association network and using the Google PageRank algorithm that magnifies nodes with higher centrality. We use these context profiles to form a context-term matrix representative of semantic traces in memory. We then study the hierarchical structure of contextually significant word clusters in different layers of the network, through examining layer blocks of the context-term matrix.Other similar studies include topic modeling, the unsupervised learning of patterns of words and phrases that can represent "topics". The mainstream view in topic modeling regards a topic as a distribution over known vocabulary. The famous Latent Dirichlet allocation (LDA) for instance (Blei et al., 2003), finds a given number of topics within a text corpus, each topic represented by a distribution over all words. LDA essentially fits a latent variable model of word combinations to a set of observed documents.We also extend our knowledge structure model to find vector representations of topics that provide summaries of the information contained in the corpus, similar to topic modeling frameworks. These vector representations are calculated by factorization of the context-term matrix. The summary outcome of this method will also reveal important sub-structures of the large hierarchical structure. For evaluation, we show that across a variety of datasets from online forums and tweets to research articles, our summary topics cover, on the average, 94% of k=60 LDA topics
Recommended from our members
12th ACM Conference on Web Science
Reader reviews of literary fiction on social media, especially those in
persistent, dedicated forums, create and are in turn driven by underlying
narrative frameworks. In their comments about a novel, readers generally
include only a subset of characters and their relationships, thus offering a
limited perspective on that work. Yet in aggregate, these reviews capture an
underlying narrative framework comprised of different actants (people, places,
things), their roles, and interactions that we label the "consensus narrative
framework". We represent this framework in the form of an actant-relationship
story graph. Extracting this graph is a challenging computational problem,
which we pose as a latent graphical model estimation problem. Posts and reviews
are viewed as samples of sub graphs/networks of the hidden narrative framework.
Inspired by the qualitative narrative theory of Greimas, we formulate a
graphical generative Machine Learning (ML) model where nodes represent actants,
and multi-edges and self-loops among nodes capture context-specific
relationships. We develop a pipeline of interlocking automated methods to
extract key actants and their relationships, and apply it to thousands of
reviews and comments posted on Goodreads.com. We manually derive the ground
truth narrative framework from SparkNotes, and then use word embedding tools to
compare relationships in ground truth networks with our extracted networks. We
find that our automated methodology generates highly accurate consensus
narrative frameworks: for our four target novels, with approximately 2900
reviews per novel, we report average coverage/recall of important relationships
of > 80% and an average edge detection rate of >89\%. These extracted narrative
frameworks can generate insight into how people (or classes of people) read and
how they recount what they have read to others
Recommended from our members
"Mommy Blogs" and the Vaccination Exemption Narrative: Results From A Machine-Learning Approach for Story Aggregation on Parenting Social Media Sites.
BackgroundSocial media offer an unprecedented opportunity to explore how people talk about health care at a very large scale. Numerous studies have shown the importance of websites with user forums for people seeking information related to health. Parents turn to some of these sites, colloquially referred to as "mommy blogs," to share concerns about children's health care, including vaccination. Although substantial work has considered the role of social media, particularly Twitter, in discussions of vaccination and other health care-related issues, there has been little work on describing the underlying structure of these discussions and the role of persuasive storytelling, particularly on sites with no limits on post length. Understanding the role of persuasive storytelling at Internet scale provides useful insight into how people discuss vaccinations, including exemption-seeking behavior, which has been tied to a recent diminution of herd immunity in some communities.ObjectiveTo develop an automated and scalable machine-learning method for story aggregation on social media sites dedicated to discussions of parenting. We wanted to discover the aggregate narrative frameworks to which individuals, through their exchange of experiences and commentary, contribute over time in a particular topic domain. We also wanted to characterize temporal trends in these narrative frameworks on the sites over the study period.MethodsTo ensure that our data capture long-term discussions and not short-term reactions to recent events, we developed a dataset of 1.99 million posts contributed by 40,056 users and viewed 20.12 million times indexed from 2 parenting sites over a period of 105 months. Using probabilistic methods, we determined the topics of discussion on these parenting sites. We developed a generative statistical-mechanical narrative model to automatically extract the underlying stories and story fragments from millions of posts. We aggregated the stories into an overarching narrative framework graph. In our model, stories were represented as network graphs with actants as nodes and their various relationships as edges. We estimated the latent stories circulating on these sites by modeling the posts as a sampling of the hidden narrative framework graph. Temporal trends were examined based on monthly user-poststatistics.ResultsWe discovered that discussions of exemption from vaccination requirements are highly represented. We found a strong narrative framework related to exemption seeking and a culture of distrust of government and medical institutions. Various posts reinforced part of the narrative framework graph in which parents, medical professionals, and religious institutions emerged as key nodes, and exemption seeking emerged as an important edge. In the aggregate story, parents used religion or belief to acquire exemptions to protect their children from vaccines that are required by schools or government institutions, but (allegedly) cause adverse reactions such as autism, pain, compromised immunity, and even death. Although parents joined and left the discussion forums over time, discussions and stories about exemptions were persistent and robust to these membership changes.ConclusionsAnalyzing parent forums about health care using an automated analytic approach, such as the one presented here, allows the detection of widespread narrative frameworks that structure and inform discussions. In most vaccination stories from the sites we analyzed, it is taken for granted that vaccines and not vaccine preventable diseases (VPDs) pose a threat to children. Because vaccines are seen as a threat, parents focus on sharing successful strategies for avoiding them, with exemption being the foremost among these strategies. When new parents join such sites, they may be exposed to this endemic narrative framework in the threads they read and to which they contribute, which may influence their health care decision making